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It is increasingly common to encounter time-varying random fields on networks (metabolic networks, sensor arrays, 
distributed computing, etc.). This paper considers the problem of optimal, nonlinear prediction of these fields, show- 
ing from an information-theoretic perspective that it is formally identical to the problem of finding minimal local 
sufficient statistics. I derive general properties of these statistics, show that they can be composed into global pre- 
dictors, and explore their recursive estimation properties. For the special case of discrete-valued fields, I describe a 
convergent algorithm to identify the local predictors from empirical data, with minimal prior information about the 
field, and no distributional assumptions. 
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1 Introduction 



Within the field of comple x systems, most interest in networks has focused either on their structural 
properties ( Newman , 2003), or on the behavior of known dynamical s ystems coupled through a network, 
especially the question of the ir synchronization ( Pikovsky et al. 2001 ). Statistical work, ably summarized 
in the book by |Guyon ( 1995 ), has largely (but not exclusively) focused on characterization and inference 
for static random fields on networks. In this paper, I consider the problem of predicting the behavior of 
a random field, with unknown dynamics, on a network of fixed structure. Adequate prediction involves 
reconstructing those unknown dynamics, which are in general nonlinear, stochastic, imperfectly measured, 
and significantly affected by the coupling between nodes in the network. Such systems are of interest in 



many areas, including biochemistry (Bower and Bolouri, 2001), sociology (Young, 1998), neuroscience 



(Day an and Abbott, 2001 ), decentralized control (Sijlak 



1990 



Mutambara, 



1998) and distributed sensor 



systems (Varshney, 1997). Cellular automata (Hachinski, 2001) are a special case of such systems, where 
the graph is a regular lattice. 

There are two obvious approaches to this problem of predicting network dynamics, which is essentially 
a problem of system identification. One is to infer a global predictor, treating entire field configurations as 
measurements from a time series; this is hugely impractical, if only because on a large enough network, 
no configuration ever repeats in a reasonable-sized data sample. The other straightforward approach is 
to construct a distinct predictor for each node in the network, treating them as so many isolated time- 
series. While more feasible, this misses any effects due to the coupling between nodes, which is a serious 
drawback. In these contexts, we often know very little about the causal architecture of the systems we are 
studying, but one of the things we do know is that the links in the network are causally relevant. In fact, 
it is often precisely the couplings which interest us. However, node-by-node modeling ignores the effects 
of the couplings, which then show up as increased forecast uncertainty at best, and systematic biases at 
worst. We cannot guarantee optimal prediction unless these couplings are taken into account. (See the 



end of Section 3.1) I will construct a distinct predictor for each node in the network, but these predictors 



will explicitly take into account the couplings between nodes, and use them as elements in their forecasts. 

By adapting tools from information theory, I construct optimal, nonlinear local statistical predictors for 
random fields on networks; these take the form of minimal sufficient statistics. I describe some of the 
optimality properties of these predictors, and show that the local predictors can be composed into global 
predictors of the field's evolution. Reconstructing the dynamics inevitably involves reconstructing the 
underlying state space, and raises the question of determining the state from observation, i.e., the problem 
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of filtering or state estimation. There is a natural translation from the local predictors to a filter which 
estimates the associated states, and is often able to do so with no error. I establish that it is possible to 
make this filter recursive without loss of accuracy, and show that the filtered field has some nice Markov 
properties. In the special case of a discrete field, I give an algorithm for approximating the optimal 
predictor from empirical data, and prove its convergence. 

Throughout, my presentation will be theoretical and abstract; for the most part, I will not deal with 
practical issues of implementing the method. In particular, I will slight the the important question of how 
much data is needed for reliable prediction. However, the method has been successfully applied to fields 
on regular lattices (see Section || below). 

I will make extensive use of (fairly basic) information theory and properties of conditional indepen- 
dence relations, and accordingly will assume some familiarity with conditional measures. I will not pay 
attention to measure-theoretic niceties, and shall assume that the random field is sufficiently regular that 
all the conditional measures I invoke actually exist and are conditional probability distributions. Readers, 
for their part, may assume all functions to be measurable functions. 

The next section of the paper establishes the basic setting, notation, and preliminary results, the latter 
mostly taken without proof from standard references on information theory and conditional independence. 
Section || constructs the local predictors and establishes their main properties. Section ^ establishes the 
results about transitions between states which are related to recursive filtering. Section || discusses an 
algorithm for identifying the states from empirical data on discrete fields. 



2 Notation and Preliminaries 

2. 1 The Graph and the Random Field X((v,t)) 

Consider a fixed graph, consisting of a set of nodes or vertices V, and undirected edges between the nodes 
E. An edge is an ordered pair (vi , V2), indicating that it is possible to go directly from v\ to V2\ the set of 
edges is thus a binary relation on the nodes. We indicate the presence of this relation, i.e. of a direct path, 
by V\Ev%; since the edges are undirected, this implies that ViEv\. There is a path of length k between two 
nodes if In addition to the graph, we have a time coordinate which proceeds in integer ticks: t 6 N 

or G Z. We need a way to refer to the combination of a vertex and a time; I will call this a point, and write 
it using the ordered pair (v,t). 

At each po int, we have a random variable X((v,t)), taking values in the set J?, a "standard alphabet" 
(Gray, 199C), such as a set of discrete symbols or a finite-dimensional Euclidean vector space; this is 
the random field. Let c be the maximum speed of propagation of disturbances in the field. Now define 
the past light-cone of the point (v,f) as the set of all points where the field could influence the field at 
(v,t) (including (v,t }); likewise the future light-cone is all the points whose fields could be influenced by 
the field at (v,f) (excluding (v,f)). We will mostly be concerned with the configurations in these light- 
cones, rather than the sets of points themselves; the configurations in the past and future light-cone are 
respectively 
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Fig. 1: Schematic of the light-cones of node (v,f). Time runs vertically downward. For visual simplicity, vertices 
are drawn as though the graph was a one-dimensional lattice, but no such assumption is necessary. The gray circles 
denote the past light-cone of (v,t), and the white ones its future light-cone. Note that (v,t) is included in its past 
light-cone, resulting in a slight asymmetry between the two cones. 

Figure [j] is a schematic illustration of the past and future light-cones* There is a certain distribution over 
future light-cone configurations, conditional on the configuration in the past; this is P(L + ((v,f))|L~((v,f)) = 

which I shall abbreviate P(L + |/~). Note that, in general, the light-cones of distinct vertices have com- 
pletely different shapes, and so neither their configurations nor the distribution over those configurations 
are comparable. 

2.2 Mutual Information 

The mutual information between two random variables X and Y is 



I[X;Y] 



Ex:, 



log 2 



p(X=x,Y=y) 



P(X = x)P(Y = y) 



(3) 



where E is mathematical expectation, and P is the probability mass function for discrete variables, and the 
probability density function for continuous ones. That is, it is the expected logarithm of the ratio between 
the actual joint distribution of X and Y, and the product of their marginal distributions. /[X;Y] > 0, and 
I[X; Y] = if and only if X and Y are independent. 
The conditional mutual information I\X: Y\Z], is 



I[X;Y\Z] = E x ,y,z 



log 2 



P(X=x,Y =y\Z = z) 



P{X = x\Z = z)P{Y =y\Z = z) 



(4) 



Conditional mutual information is also non-negative. 

We will need only two other properties of mutual information. The first is called the "data-processing 
inequality". For any function /, 



I[f(X);Y] < I[X;Y] and 



(5) 



* The term "light-cone" is used here in analogy with relativistic physics, but it's just an analogy 
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I[f(X);Y\Z] < I[X;Y\Z]. (6) 

The other is called the "chain rule". 

I[X,Z;Y] = I[X;Y]+I[Z;Y\X] (7) 

For more on the nature and uses of mutual information, and proofs of Equations ||-|7j see any text on 
information theory, e.g. |Gray| ( |1990| ). 

2.3 Conditional Independence 

Two random variables, X and Y , are conditionally independent given a third, T, in symbols X^-Y \ T, when 

P(X,Y\T) = P(X\T)P(Y\T) . (8) 

X-lLy|r if and only if 

P(Y\X,T) = P(F|r)and (9) 
I[X;Y\T] = 0. (10) 

Conditional independence has many implicative properties; we will use s ome of the s tanda rd ones, 
proof s of w hich may be found in any book on graphical models ( ^auritzen , 1996 ; Pearl , 2000 ; Spirtes 



etal, 2001). 



(A-\i-B\CD)A(A^D\CB) (A-H-BD\C) (11) 

(AXBC|D) (AJLB\CD) (12) 

(A-U-B|C) A (A-U-D\CB) ^ {A-U-BD\C) (13) 

Sadly, 

(A-JLS) ^ (A-U-B\C) (14) 

(AJLB\C) i> (AA-B\CD) (15) 

The following property, while not included in most lists of conditional independence properties, is of 
some use to us: 

(A-iLB\C) => (AJLf(B)\C) (16) 

for any nonrandom function /. It is a direct consequence of the non-negativity of conditional mutual 
information, and Equations || and |To[ 

2.4 Predictive Sufficient Statistics 

Any function on the past light-cone defines a local statistic, offering some summary of the influence of 
all the points which could affect what happens at (v,t). A good local statistic conveys something about 
"what comes next", i.e., L + . We can quantify this using information theory. 

To be slightly more general, suppose we know the random variable X and wish to predict Y . Any 
function f{X) defines a new random variable F — f(X) which is a statistic. Because F is a function of X, 
the data processing inequality (Equation ||) says I[Y;F] < I[Y;X]. 
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Definition 1 (Sufficient Statistic) A statistic F = f(X) is sufficient for Y when I[Y;F] = I[X;F]. 

A sufficient statistic, that is to say, is one which is as informative as the original data. While I will not 
elaborate on this here, it is very important to remember that any prediction method which uses a non- 
sufficient statistic can be bettered by one which does. No matter what the loss function for prediction, 
the optimal predictor can always be implemented by an algorithm which depends solely on a sufficient 



statistic (Blackwell and Girshick , 1954) 



Here are two important criteria, and consequences, of sufficiency. 
Proposition 1 F is a sufficient statistic, as in Definition [/J if and only if Vx, 

P(Y\X=x) = P(Y\F=f(x)). (17) 



See Kullback (1968, Sections 2.4 and 2.5). (Some authors prefer to use this as the definition of suffi- 
ciency.) 

Lemma 1 F is a sufficient statistic if and only if X and Y are conditionally independent given F, i.e., 
X-iLY\F. 

Proof. "Only if": begin with the observation that 

P(Y\X,f(X))=P(Y\X), (18) 

no matter what the function 55 . Hence P(Y\X,F) = P(Y\X). But if Y^-X\F, then P(Y\X,F) = P(Y\F). 
Hence P(Y\X) = P(Y\F). "If": start with P(Y\F) = P(Y\X). As before, by Equation [lj| since F = f (X), 
P(Y\X) =P(Y\X,F). Hence P(Y\F) =P(Y\X,F), so (Eq.g), X-1L7|F. □ 
While all sufficient statistics have the same predictive power, they are not, in general, equal in other 
respects. In particular, some of them make finer distinctions among past light-cones than others. Since 
these are distinctions without a difference, we might as well, in the interests of economy, eliminate them 
when we find them. The concept of a minimal sufficient statistic captures the idea of eliminating all the 
distinctions we can get rid off, without loss of predictive power. 

Definition 2 (Minimaf Sufficiency) F is a minimal sufficient statistic for predicting Y from X if and only 
if it is sufficient, and it is a function of every other sufficient statistic. 



3 Construction and Properties of Optimal Local Predictors 

3. 1 Minimal Local Sufficient Statistics 

We have observed that for each past light-cone configuration l~ at a point, there is a certain conditional 
distribution over future light-cone configurations, P(L + |/~). Let us say that two past configurations, or 
"pasts", are equivalent if they have the same conditional distribution, 

/f~/2 <S> P(L + \l^) =P(L + \l-) (19) 

Let us write the equivalence class of l~, that is, the set of all pasts it is equivalent to, as We now 
define a local statistic, which is simply the equivalence class of the past light-cone configuration. 

§ Let A = a(X) be the sigma-algebra generated by X, and 8 = o(f(X)). Clearly B C H , so a(X,f(X)) = SlB = A = o(X). Thus, 
the sigma-algebra involved when we condition on X is the same as when we condition on X and f(X) at once. 
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Definition 3 (Local Causal State) The local causal state at (v,t), written S((v,f)), is the set of all past 
light-cones whose conditional distribution of future light-cones is the same as that of the past light-cone 
at (v,t). That is, 



S((v,t)) = e(r) 



= in 

= {i\p(L + \\)=p(L+\n} 



(20) 
(21) 



The name "causal state" was introduced for the analogous construction for time series by Crutchfield 
and Young (1989). I will continue to use this term here, for two reasons. First, without getting into the 
vexed questions surrounding the nature of causality and the properties of causal relations, it is clear that the 
causal states have at least the "screening" properties (Salmon, 1984) all authorities on causality require, 



and may well have the full set of counter-factual or "intervention" properties demanded by Pearl (200C) 
and |Spirtes et al. (2001). (When and whether they meet the stricter criteria is currently an open question.) 
Second, and decisively, these objects need a name, and the previous terms in the literature, like "action- 



test pairs in a predictive state representation" (Littman et al., 2002), "elements of the statistical relevance 



basis" (Salmon, 1984), or "states of the prediction process" (Knight, 1975), are just too awkward to use 



Theorem 1 (Sufficiency of Local Causal States) The local causal states are sufficient. 

Proof. First, we will find the distribution of future light-cone configurations conditional on the causal 
state. Clea rly, this is the average over the distributions conditional on the past light-cones contained in the 
state. (See Loeve (1955, Section 25.2) for details.) Thus, 



F(L+\S = s) = 



Xee-i(j) 



P(Z + |ZT = X)P(L~ = X\S = s)dX . 



(22) 



By construction, P(L + \L = X) is the same for all A, in the domain of integration, so, picking out an 
arbitrary representative element l~, we pull that factor out of the integral, 



P(L + \S = s) 



p(l+\l- = r 



P(ZT = X\S = s)dX. 



But now the integral is clearly 1, so, for all l~, 

p(L+\s = e(r)) = p{L+\L- = r). 

And so, by Proposition [j], S is sufficient. 

Corollary 1 The past and future light-cones are independent given the local causal state. 
Proof. Follows immediately from the conclusion of the theorem and Lemma [j]. 



(23) 

(24) 
□ 

□ 



Corollary 2 Let K be the configuration in any part of the future light-cone. Then, for any local statistic 
R, 



I[K;R] <I[K;S]. 

This corollary is sometimes useful because I[L + ;R] can be infinite for all non-trivial statistics. 



(25) 
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Theorem 2 (Minimality of Local Causal States) The local causal state is a minimal sufficient statistic. 

Proof: We need to show that, for any other sufficient statistic R = T|(L~) there is a function h such 
that S = h{R). From Proposition [l] P(L+\R = r\(l~)) = P(L+|Z _ ). It follows that r|(/f ) = T](/^~) only if 
P(L + |Zj~) = P(L + |/^). This in turn implies that l^ ~ l^, and so e(I^) — e(Z^). Thus, all histories with a 



and one can determine S from R. 



Hence the 

□ 



common value of r\(l ) also have the same value of e(/~ 
required function exists. 

Corollary 3 (Uniqueness of the Local Causal States) IfR — i\(L~) is a minimal local sufficient statis- 
tic, then T| andz define the same partition of the data. 

Proof. Since S is a minimal statistic, it is a function of R, and for some function h, S — h(R). But 
since R is also minimal, there is a function g such that R = g(S). Hence e(/j~) = £(^T), if an d onr y if 

Ti(/r)=Ti(o- ~ n 

In the introduction, I claimed that if we tried to predict the future of the network by building a separate 
model for each vertex, taken in isolation, the results would generally be sub-optimal. Theorem |2] and 
Corollary |] vindicate this claim. The only way predictions based on the vertex-by-vertex procedure could 
be as good as ones based on the full light-cone is if all of the rest of the light-cone was always irrelevant. 
This would mean that there was, effectively, no coupling whatsoever between the vertices. 

3.2 Statistical Complexity of the Field 

Much thought has gone into the problem of defining a measure of comple xity that is not simply a measure 
of randomness, as are Shannon entropy and the algorithmic infor mation (Badii and Politi , 1997 ). Perhaps 
the best suggestion is the one which seems to have originated with Grassberger ( 1986 ), that the complexity 
of a system is the minimal amount of inf o rmati on about its state needed for optimal prediction. This 
suggests, following Crutchfield and Young ( 1989 ), that we identify the complexity of the system with the 
amount of information needed to specify its causal state. Crutchfield and Young called this quantity the 
statistical complexity. 

In the case of random fields, it is more appropriate to look at a local, point-by-point version of this 
quantity. That is, the local statistical complexity is 

C«v,f» = I[S((v,t));L-((v,t))}. (26) 

Dropping the argument, we see that because of Equation || and the fact that S is a minimal statistic, 

I[R;L-\ > C (27) 

for any other sufficient statistic R. In fact, one can start with Equations ^ and maximal predictive 
power and minimal complexity, as a xioms, and from them derive all the properties of the local causal 
states dShalizi and Crutchfield], |200l|). 



A useful property of the statistical complexity is that it provides an upper bound on the predictive 
information. 



1[L + ;LT] < I[L-;S] 

Proof Use the chain rule for information, Equation^. 

I[S,L+;L-} = I[S;L-]+I[L+;L-\S] 
= i[S;L-}, 



(28) 

(29) 
(30) 
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since Corollary |T] and Equation [To| imply I[L + ;L \S] = 0. Using the chain rule the other way, 

I[S,L+;L-] = I[L + ;L-]+I[S;L-\L + ] (31) 
> I[L + ;L~], (32) 

since I[S;L-\L+] > 0. 

Shalizi and Crutchheld ( 200 1 ) give detailed arguments for why C is the right way to measure complex- 
ity; here I will just mention two more recent applications. First, when our random field can be interpreted 
as a macroscopic variable which is a coarse-graining of an underlying microscopic system, as in statistical 
mechanics, then C is the amount of information the macro-variable contains about the micro-state (Shalizi 



and Moore, 2003 ). Second, the rate of change of C ove r time provides a quantitative measurement of 
self-organization in cellular automata ( |Shalizi and Shalizi , 2003 ). 



3.3 Composition of Global Predictors from Local Predictors 

We have just seen that, if we are interested in the future of an individual point, we can do optimal predic- 
tion with only a knowledge of its local causal state. One might worry, however, that in compressing the 
past light-cone down to the causal state, we have thrown away information that would be valuable on a 
larger scale, that would help us if we wanted to predict the behavior of multiple vertices. In the limit, if 
we wanted to predict the behavior of the entire network, and had only the local causal states available to 
us, how badly would we be hampered? 

To address these issues, let us turn our thoughts from a single point to a connected set of vertices taken 
at a common time t, or a patch. The patch has a past and future light-cone (JP~ and P + , respectively), 
which are just the unions of the cones of its constituent points. Just as we did for points, we can construct 
a causal state for the patch, which we'll call the patch causal state, which has all the properties of the 
local causal states. We now ask, what is the relationship between the local causal states of the points in 
the patch, and the patch causal state? If we try to predict the future of the patch using just the local states, 
are our predictions necessarily impaired in any way? 

The answer, it turns out, is no. 

Lemma 2 (Patch Composition) The causal state of a patch at one time is uniquely determined by the 
composition of all the local causal states within the patch at that time. 

Proof. We will show that the composition of local causal states within the patch is a sufficient statistic 
of the patch, and then apply minimality. 

Consider first a patch consisting of two spatially adjacent points, (u,t) and (v,f ). Define the following 
variables: 

C = L-(( u ,t))nL-((v,t)) 
u~ = Lr({u,t))\cr 

V- = L-((v,t))\C- 

Thus LT((u,t)) — U~ UC~, and likewise for L ((v,t)). Define U + , C + and V + similarly. (Figure ^ gives 
a picture of these regions.) Now consider the configurations in these regions. We may draw a diagram of 
effects or influence, Figure [3j which should not be confused with the graph of the network. 
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i-l 



(+1 



1+2 




Fig. 2: The space-time regions for a two-point patch. Points which belong exclusively to the light-cones of the point 
on the left ({«,?)) are shaded light grey; those which belong exclusively to the light-cones of the other point ({v,t}) 
are shaded dark grey. The areas of overlap (C~ and C + ) is white. Note that, by the definition of light-cones, the 
configuration in U~ can have no effect on that in V + or vice versa. 



u- 




C- 




V- 




u+ 




c+ 




v+ 



Fig. 3: Diagram of effects for the two-node patch. Arrows indicate the direction of influence; causes to effects; the 
absence of an arrow between two nodes indicates an absence of direct influence. 



Corollary [T] tells us that every path from U~ or C to U + must go through S{{u,t)). By the very 
definition of light-cones, there cannot be a path linking V ~~ to U + . Therefore there cannot be a link from 
S((v,f)) to U + . (Such a link would in any case indicate that U + had a dependence on C which was not 
mediated by S((u,t)), which is false.) All of this is true, mutatis mutandis, for V + as well. 

Now notice that every path from variables in the top row — the variables which collectively consti- 
tute P — to the variables in the bottom row — which collectively are P + — must pass through either 
S((u,t)) or S((v,t ) ). Th e set Z = {S((u,t)),S((v,t))} thus "blocks" those paths. In the terminology of 
graphical models (Pearl, 2000, p. 18), Z d-separates P and P + . But d-separation implies conditional 



independence (ibid.). Thus P and P + are independent given the composition of S((u,t)) and S((v,t}). 
But that combination is a function of P , so Lemma [l] applies, telling us that the composition of local 
states is sufficient. Then Theorem || tells us that there is a function from the composition of local states to 
the patch causal state. 

Now, the reader may verify that this argument would work if one of the two "nodes" above was really 



Prediction of Random Fields on Networks 



11 



itself a patch. That is, if we break a patch down into a single node and a sub-patch, and we know their 
causal states, the causal state of the larger patch is fixed. Hence, by mathematical induction, if we know 
all the local causal states of the nodes within a patch, we have fixed the patch causal state uniquely. □ 

Theorem 3 (Global Prediction) The future of the entire network can be optimally predicted from a 
knowledge of all the local causal states at one time. 

Proof. Sufficiency, as we've said, implies optimal prediction. Let us check that the combination of all 
the local causal states is a sufficient statistic for the global configuration. Apply Lemma ^ to the patch of 
the entire lattice. The proof of the lemma goes through, because it in no way depends on the size or the 
shape of the past, or even on the patch being finite in extent. Since the patch causal state for this patch 
is identical with the global causal state, it follows that the latter is uniquely fixed by the composition of 
the local causal states at all points on the lattice. Since the global causal state is a sufficient statistic, the 
combination of all local causal states is too. □ 
Thus, remarkably enough, the local optimal predictors contain all the information necessary for global 
optimal prediction as well. There doesn't seem to be any way to make this proof work if the local predic- 
tors are not based on the full light-cones. 



4 Structure of the Field of States S({v,t)) 



Let us take a moment to recap. We began with a dynamic random field X on the network. From it, we 
have constructed another random field on the the same network, the field of the local causal states S. 
If we want to predict X, whether locally (Theorem |l]) or globally (Theore m p|), it is sufficient t o know 
S. This situation resembles attractor reconstruction in nonlinear dynamics (Kantz and Schreibei, 1997), 
state-space models in time series analysis ( purbin and Koopman 2001), and hidden Markov models in 



signal processing (Elliot et al. , 1995 ). In each case, it helps, in analyzing and predicting our observations, 



to regard them as distorted measurements of another, unseen set of state variables, which have their own 
dynamics. Let us, therefore, call all these things "hidden-state models". 

The hidden state space always has a more tractable structure than the observations, e.g., the former is 
Markovian, or even deterministic, but the latter is not. Now, usually the nice structure is simply demanded 
by us a priori, and it is an empirical question whether the demand can be met, whether any hidden-state 
model with that structure can adequately account for our observations. However, in the case of time 
series, one can construct hidden states, analogous to those built in Section ||, and show that these always 
have nice structural properties: they are always homogeneous Markov processes, and so forth (Knight, 
1975; Shalizi and Crutchfielcj , 2001 ). This analogy gives us reason to hope that the causal states we have 



constructed always have nice properties, which is indeed the case. 

In this section, I am going to establish some of the structural proper ties of the field of causal states, 
which involve the relations between states at different points. Section |4.1 shows how the state at one 



point may be used to partially determine the state at other points. Section 4.2 applies those results to the 
problem of designing a filter to estimate the state field S from the observation field X. Finally, Section 4.3 
considers the Markov properties of the S field. Note that, if we tried to build a hidden-state model for each 
vertex separately, the result would lack these spatial properties, as well as being a sub-optimal predictor 
(see the end of Section 3.1 ). 
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4. 1 Transition Properties 

The basic idea motivating this section is that the local causal state at (v,t) should be an adequate summary 
of L~({v,t }) for all purposes. We have seen that this is true for both local and non-local prediction. Here 
we consider the problem of determining the state at another point (u,s), s > t. In general, the past light- 
cones of (v,f) and (u,s) will overlap. If we have S((v,t)), and want S((u,s)), do we need to know the 
actual contents of the overlapping region, or can we get away with just knowing S((v,f))? 

It seems clear (and we will see that it is true) that determining S((u,s}) will require knowledge of the 
"new" observations, the ones which are in the past light-cone of (u,s) but not in L~((v,t)). This data is 
relevant to (u,s), but inaccessible at (v,t), and couldn'tbe summarized in S((v,t)). This data also needs a 
name; let us call it the fringe seen when moving from (v,f ) to (u,s). (See Figures || and^.) 

The goal of this section is to establish that S((u,s)) is completely determined by S((v,t)) and the fringe. 
The way I do this is to show that L + ((u,s)) and L ( (u, s) ) are independent, conditional on S( (v, t ) ) and the 
fringe. This means that the latter two, taken together, are a sufficient statistic, and then I invoke Theorem 
H An alternate strategy would be to consider a transducer whose inputs are fringes, read as the transducer 
moves across the graph, and whose outputs are local states. Our problem then would be to show that 
the transducer's transition rules can always be designed to ensure that the states returned track the actual 
states. This way of framing the problem opens up valuable connections to the theory of spatial automata 



and spatial languages (Lindgren et al, 1998), but I prefer a more direct approach by way of conditional 
independence. 

Even so, the proof is frankly tedious. First, I show that the desired property holds if we consider 
successive states at the same vertex. Next, I show that it holds for simultaneous states at neighboring 
vertices. Then I extend those results to relate states at points connected by an arbitrary path in space and 
time. Finally, I show that the result is independent of the precise path chosen to link the points. Sadly, I 
have not been able to find a simpler argument. 

4.1.1 Temporal Transitions 

We want to move forward in time one step, while staying at the same vertex. Call the point we start at 
(v,f), and its successor (v,t+ 1). The whole of the new future light cone is contained inside the old future 
light cone, and vice versa for the past cones. So let's define the following variables: 

N- = L-((v,f+l»\L-((v,*)) 
M+ = L + «v,f»\Z + «v,f + l»; 

N~ is the fringe. (Figure |] pictures these regions.) 

Lemma 3 The local causal state at (v,t + 1) is a function of the local causal state at (v,t) and the time- 
forward fringe N~. 

Proof. Start by drawing the diagram of effects (Figure ||). 

M + and L + ((v,t + 1}) jointly constitute L + ((v,t}), so there must be paths from S((v,t)) to both of 
them. Now, S((v,t + 1}) renders L~((v,f + 1)) and L + ((v.f+l}) conditionally independent. Hence it 
should d-separate them in the graph of effects. But L~((v. t)) is part of L~((v. t + 1}) and has a direct 
path to S((v,t)). This means that there cannot be a direct path from 5((v,f)) to L + ((v,t + 1)); rather, 
the path must go through S((v,t+ 1)). (We indicate this in the graph by a dotted arrow from S((v,t}) to 
L + ((v,t + 1)). Similarly, L~((v,t)) certainly helps determine S((v,t+ 1}), but it need not do so directly. 
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Fig. 4: Space-time regions for the time-forward transition from (v,f) to (v,t + 1). Black points: L~({v,t}), the past 
light-cone of (v,t). Dark grey: N~, the part of L~({v,t + 1)), the past light-cone of (v,t+ 1), outside the past light- 
cone of (v,t). Light grey: L + ({v,t + 1)), the future light-cone of (v,f + 1). White: M + , the part of L + ({v.t)) outside 

In fact, it cannot: S((v,t)) must d-separate L~((v,f)) and L + ((v,f)), i.e., must d-separate L~((v,t)) from 
L+((v,f + l)) andM+. Hence the influence of L~((v,t)) onS((v,r + l)) must run through 5((v,f)). (We 
indicate this, too, by a dotted arrow from L~({v,t)) to S((v,t + 1)).) 

Now it is clear that the combination of S((v,t)) and N~ d-separates L~((v,t + 1)) from L + ((v,t + 1)), 
and hence makes them conditionally independent. But now the usual combination of Lemma [j] and 
Theorem^ tell us that there's a function from S({v,t)),N~ to S((v,t + 1)). □ 

4.1.2 Spatial Transitions 

Lemma 4 Let (u,t) and (v,f) be simultaneous, neighboring points. Then S({v,t)) is a function ofS((u,t)) 
and the fringe in the direction from (u,t) to (v,t), V~. 

Here the breakdown of the past and future light-cone regions is the same as when we saw how to 



compose patch causal states out of local causal states in Section 3.3, as is the influence diagram; we'll 
use the corresponding terminology, too. (See Figures ||] and |[ respectively.) What we hope to show here 
is that conditioning on the combination of S((u,t)) and V~ makes L + ((v,t)) independent of V~ and C . 
Unfortunately, as the reader may verify by inspecting the diagram, our conditional variables no longer 
d-separate the other variables (since they have an unblocked connection through S((v,t))). All is not lost, 
however: d-separation implies conditional independence, but not conversely. 

Abbreviate the pair of variables {S({u,t)),V~} by Z. Now, 5((v,f )) is a (deterministic) function of C 
and y- . Hence it is also a function of Z and C . Thus P(V + \S({v,t)),Z,C~) = P(V + |Z,C"). But this 
tells us that 

y+Xs((v,f))|Z,CT (33) 
V + JLcr\Z,S((v,t)) (34) 
V+XS(( V )),C-|Z (35) 



From d-separation, we also have 
Applying Eq. |n], 
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L-(<v,t>) 


— ► 


N- 


1 






1 





S(<v,t>) 



S(<v,t+1> 



1 






1 




M+ 


— ^> 


L+(<v,t+l>) 



Applying Eq. |1J, 
Since Z = Z,V~, 



Fig. 5: Influence diagram for the variables involved in a time-forward transition. 
1 

y+JUrlz 



V + -U-Cr\Z,V~ 

The following conditional independence is odd-looking, but trivially true: 

y+Xy-|z 

And it, along with Eq. [j~3|, gives us 

v+Ji-C- viz 



(36) 
(37) 

(38) 
(39) 



A similar train of reasoning holds for C + . Thus, the entire future light cone of (v,f) is independent 
of that point's past light cone, given S((u,t)) and V~. This tells us that {S((u,t)),V~} is sufficient for 
L + ((v,t)), hence S((v,t)) is a function of it. □ 

4.1.3 Arbitrary Transitions 

Lemma 5 Let (u,t) and (v,s) be two points, s >t. Let T be a spatio-temporal path connecting the 
two points, arbitrary except that it can never go backwards in time. Let Fr be the succession of fringes 
encountered along T. Then S({v,s}) is a function ofS((u,t)), T and Fr, 

S((v,s)) = g(S((u,t)),r,F r ) (40) 

for some function g. 

Proof. Apply Lemma || or |] at each step of T. □ 

Theorem 4 Let {u,t) and (y,s) be two points as in the previous lemma, and let Ti, T2 be two paths 
connecting them, and Ft { and Fr 2 their fringes, all as in the previous lemma. Then the state at (v,s) is 
independent of which path was taken to reach it, 



g (s((u,t)),r u F ri ) = g (s((u,t)),r 2 ,F T2 ) . 



(41) 
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Proof. Suppose otherwise. Then either the state we get by going along Y\ is wrong, i.e., isn't S((v,s)), or 
the state we get by going along T2 is wrong, or both are. 

S((v,s))^g(S(( U ,t)),T u F ri ) V S((v,s))^g(S((u,t)),T 2 ,F r2 ) (42) 
L+((v,s))^LL-((v,s))\S((u,t)),r u F^ V L + ((v,s))^ir((v,s})\S((«,t}),r 2 ,Ft 2 (43) 
-(L+((v,.})AL-(( v ,,))|5((«,f)),r 1 , J Fr 1 A L+ ((v,i))±r ((v, s ))\S((u,t)),r 2 ,Fr 2 ) (44) 

But,byLemma|,L+((y,^})XL-((v,5))|5(( M ,f)),r 1 ,/T 1 andL+((v,t))^L- {(v,s))\S{(u,t)),F 2 ,F r2 . Hence 
transitions must be path-independent. □ 

4.2 Recursive Filtering 

Because the causal states are logical constructions out of observational data, in principle the state at any 
point can be determined exactly from looking at the point's past light-cone. It may be, however, that 
for some fields one needs to look infinitely far back into the past to fix the state, or at least further back 
than the available data reaches. In such cases, one will generally have not exact knowledge of the state, 
but either a set-valued type estimate (i.e., S £ S , for some set of states S), or a distribution over states, 
supposing you have a meaningful prior distribution. 

A naive state-estimation scheme, under these circumstances, would produce an estimate for each point 
separately. The transition properties we have just proved, however, put deterministic constraints on which 
states are allowed to co-exist at different points. In particular, we can narrow our estimates of the state at 
each point by requiring that it be consistent with our estimates at neighboring points. Applied iteratively, 
this will lead to the tightest estimates which we can extract from our data. If we can fix the state at even 
one point exactly, then this will propagate to all points at that time or later. 

More generally, rather than first doing a naive estimate and then tightening it, we can incorporate the 
deterministic constraints directly into a recursive filter. Under quite general circumstances, as time goes 
on, the probability that such a filter will not have fixed the state of at least one point goes to zero, and once 
it fixes on a state, it stays fixed. Such filters can be implemented, at least for discrete fields, by means of 



finite-state transducers; see Shalizi ( 2001 , Chapter 10) for examples (further specialized to discrete fields 
on regular lattices). 

4.3 Markov Properties 



Recursive estimation is a kind of Markov property (Nevel'son and Has'minskii, 1972/1976), and the fact 



that it can work exactly here is very suggestive. So, too, is the fact that the past and the future light-cones 
are independent, conditional on the causal state. It would be very nice if the causal states formed a Markov 
random field; for one thing, we could then exploit the well-known machinery for such fields. For instance, 



we would know, from the equivalence between Gibbs distributions and Markov fields ( puyon| , [1995J), that 
there was an effective potential for the interactions between the states across the network. 

Definition 4 (Parents of a Local Causal State) The parents of the local causal state at (v,t) are the 
causal states at all points which are one time-step before (v,t) and inside its past light-cone: 

A((v,t)) = {S((u,t-l))\(u = v)V(uEv)} (45) 

Lemma 6 The local causal state at a point, 5((v,f )), is independent of the configuration in its past light 
cone, given its parents. 

S((v,t))-UT((v,t))\A((v,t)) (46) 
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Proof. (v,t) is in the intersection of the future light cones of all the node in the patch at r — 1 . Hence, 
by the arguments given in the proof of the composition theorem, it is affected by the local states of all 
those nodes, and by no others. In particular, previous values of the configuration in L~((v,t)) have no 
direct effect; any influence must go through those nodes. Hence, by d-separation, S((v,t}) is independent 
ofZT«v,?». ' " . . ^ 

Theorem 5 (Temporal Markov Property) The local causal state at a point, S((v,t}), is independent of 
the local causal states of points in its past light cone, given its parents. 

Proof. By the previous lemma, S((v,t)) is conditionally independent of L~({v,t)) given its parents. But 
the local causal states in its past light cone are a function of L~((v,f )). Hence by Equation |l6[ S((v,t)) is 
also independent of those local states. □ 
Comforting though that is, we would really like a stronger Markov property, namely the following. 

Conjecture 1 (Markov Field Property) The local causal states form a Markov field in space-time. 

Argument for why this is plausible. We've seen that, temporally speaking, a Markov property holds: given 
a patch of nodes at one time, they are independent of their past light cone, given their causal parents. What 
we need to add for the Markov field property is that, if we condition on present neighbors of the patch, as 
well as the parents of the patch, then we get independence of the states of all points at time t or earlier. 
It's plausible that the simultaneous neighbors are informative, since they are also causal descendants of 
causal parents of the patch. But if we consider any more remote node at time t, its last common causal 
ancestor with any node in the patch must have been before the immediate parents of the patch, and the 
effects of any such local causal state are screened off by the parents. 

Unfortunately, this is not really rigorous. In the somewhat specialized case of discrete fields on regular 
lattices, a number of systems have been checked, and in all cases the states have turned out to be Markov 
random fields. 



5 State Reconstruction for Discrete Fields 

The local causal states are a particular kind of hidden-state model; they combine optimal prediction prop- 
erties (Section ||) with the nice structural properties of hidden-state models (Section |j). It is only human 
to be tempted to use them on actual systems. In the special case of discrete-valued fields, one can reliably 
identify the causal states from empirical data, with minimal assumptions. This section describes an algo- 



rithm for doing so, building on earlier procedures for cellular automata (Shalizi and Shalizi, 2003). As 
always, I take structure of the network to be known and constant. 

Assume that for each point we have estimates of the conditional probability distribution P(L + ,L~) 
over light-cones for each point. (These could come, for instance, from the empirical distribution in an 
ensemble of networks with the same structure, or from the same network observed over time, if certain 
ergodicity assumptions can be made.) The resulting conditional distributions, P(L + |/~), can be treated as 
multinomials. We then cluster past configurations, point by point, based on the similarity of their condi- 
tional distributions. We cannot expect that the estimated distributions will match exactly, so we employ 
a standard test, e.g. % 2 , to see whether the discrepancy between estimated distributions is significant. 
These clusters are then the estimated local causal states. We consider each cluster to have a conditional 
distribution of its own, equal to the weighted mean of the distributions of the pasts it contains. 
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U <— list of all pasts in random order 
Move the first past in U to a new state 
for each past in U 
noMatch <- TRUE 
state <— first state on the list of states 
while (noMatch and more states to check) 

noMatch <— (Significant difference between past and state?) 
if (noMatch) 

state <— next state on the list 
else 

Move past from U to state 
noMatch <- FALSE 
if (noMatch) 
make a new state and move past into it from U 

Fig. 6: Algorithm for grouping past light-cones into estimated states 

As a practical matter, we need to impose a limit on how far back into the past, or forward into the future, 
the light-cones are allowed to extend — their depth. Also, clustering cannot be done on the basis of a true 
equivalence relation. Instead, we list the past configurations \l7 } in some arbitrary order. We then create 
a cluster which contains the first past, 17, For each later past, say 17, we go down the list of existing 
clusters and check whether P t (L + \l7) differs significantly from each cluster's distribution. If there is no 
difference, we add 17 to the first matching cluster and update the latter's distribution. If 17 does not match 
any existing cluster, we make a new cluster for 17. (See Figure || for pseudo-code.) As we give this 
procedure more and more data, it converges in probability on the correct set of causal states, independent 
of the order in which we list past light-cones (see below). For finite data, the order of presentation matters, 
but we finesse this by randomizing the order. 

Suppose that the past and the future light-cones are sufficiently deep that they suffice to distinguish the 
causal states, i.e., that if we had the exact distribution over light-cones, the true causal states would coin- 
cide exactly with those obtained from the distribution over limited-depth light-cones. Then, conditioning 
on the limited-depth past cones makes futures independent of the more remote past. Indeed, every time 
we examine the future of a given past, we take an independent sample from an unchanging distribution 
over futures. Thus, the strong law of large numbers tells us that the empirical probability of any future 
configuration will converge on the true probability almost surely. Since, with finite future depth, there 
are only finitely many possible future configurations, the conditional distribution as a whole also con- 
verges almost surely. And, if there are only finitely many vertices, we get over-all convergence of all the 
necessary conditional distributions. 

For this analysis to hold, we need to know three things. 

1 . The structure of the graph. 

2. The constant c which is the maximum speed at which information can propagate. 

3. The minimum necessary depth of past and future cones. 
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Note that if we over-estimate (2) and (3), we will not really harm ourselves, but under-estimates are 
harmful, since in general they will miss useful bits of predictive information. 

Note that this analysis, like the algorithm, is somewhat crude, in that it doesn't make use of any of 
the properties of the causal states we established earlier. In the analogous case of time series, using those 
properties can greatly speed up the reconstruction process, and allows us to not merely prove convergence, 
but to estimate the rate of convergence (Shalizi et ah, 2002). Probably something like that could be done 
here, with the additional complication that each vertex must get its own set of states. 



6 Conclusion 

The purpose of this paper has been to define, mathematically, optimal nonlinear predictors for a class of 
complex systems, namely dynamic random fields on fixed, undirected graphs. Starting with the basic idea 
of maximizing the predictive information, we constructed the local causal states of the field, which are 
minimal sufficient statistics, and so the simplest possible basis for optimal prediction. We then examined 
how to combine these states for non-local prediction, and the structure of the causal-state field. The last 
section described an algorithm for reconstructing the states of discrete fields. 
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